unseen attack
CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense
Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39\% (+4.01\%) on CIFAR-10, 56.25\% (+3.13\%) on CIFAR-100, and 82.62\% (+4.93\%) on GTSRB (German Traffic Sign Recognition Benchmark).
- North America > United States > Virginia (0.05)
- North America > United States > Maryland (0.05)
- Atlantic Ocean > North Atlantic Ocean > Chesapeake Bay (0.05)
Dual Manifold Adversarial Robustness: Defense against L p and non-L p Adversarial Attacks A OM-ImageNet Details A.1 Overview
Figure 1: Visual comparison between original images and projected images. All the classification models are trained using two P6000 GPUs with a batch size of 64 for 20 epochs. We study how different choices affect the robustness of the trained networks against unseen attacks. Table 4: Classification accuracy against unseen attacks applied to OM-ImageNet test set. Table 5. 3 Table 5: Classification accuracy against known (PGD-50 and OM-PGD-50) and unseen attacks Brighter colors indicate larger absolute differences.
- North America > Canada (0.05)
- Asia > Middle East > Jordan (0.05)
- Information Technology > Security & Privacy (0.51)
- Government > Military (0.41)
Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
Dabas, Mahavir, Huynh, Tran, Billa, Nikhil Reddy, Wang, Jiachen T., Gao, Peng, Peris, Charith, Ma, Yao, Gupta, Rahul, Jin, Ming, Mittal, Prateek, Jia, Ruoxi
Large language models remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Defending against novel jailbreaks represents a critical challenge in AI safety. Adversarial training -- designed to make models robust against worst-case perturbations -- has been the dominant paradigm for adversarial robustness. However, due to optimization challenges and difficulties in defining realistic threat models, adversarial training methods often fail on newly developed jailbreaks in practice. This paper proposes a new paradigm for improving robustness against unseen jailbreaks, centered on the Adversarial Déjà Vu hypothesis: novel jailbreaks are not fundamentally new, but largely recombinations of adversarial skills from previous attacks. We study this hypothesis through a large-scale analysis of 32 attack papers published over two years. Using an automated pipeline, we extract and compress adversarial skills into a sparse dictionary of primitives, with LLMs generating human-readable descriptions. Our analysis reveals that unseen attacks can be effectively explained as sparse compositions of earlier skills, with explanatory power increasing monotonically as skill coverage grows. Guided by this insight, we introduce Adversarial Skill Compositional Training (ASCoT), which trains on diverse compositions of skill primitives rather than isolated attack instances. ASCoT substantially improves robustness to unseen attacks, including multi-turn jailbreaks, while maintaining low over-refusal rates. We also demonstrate that expanding adversarial skill coverage, not just data scale, is key to defending against novel attacks. \textcolor{red}{\textbf{Warning: This paper contains content that may be harmful or offensive in nature.
- North America > United States > Virginia (0.04)
- Africa > Kenya (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Dual Manifold Adversarial Robustness: Defense against L p and non-L p Adversarial Attacks A OM-ImageNet Details A.1 Overview
Figure 1: Visual comparison between original images and projected images. All the classification models are trained using two P6000 GPUs with a batch size of 64 for 20 epochs. We study how different choices affect the robustness of the trained networks against unseen attacks. Table 4: Classification accuracy against unseen attacks applied to OM-ImageNet test set. Table 5. 3 Table 5: Classification accuracy against known (PGD-50 and OM-PGD-50) and unseen attacks Brighter colors indicate larger absolute differences.
- North America > Canada (0.05)
- Asia > Middle East > Jordan (0.05)
- Information Technology > Security & Privacy (0.51)
- Government > Military (0.41)
In this work, we consider the scenario when the 1 manifold information is exact and show that this information can be very useful for improving robustness to novel
How DMA T can be exploited for standard tasks/datasets? PGD should not be viewed as the strongest attack for evaluation. Results are shown in Table B. Results are presented in the last column of Table B. DMA T Other strong baselines such as TRADES should be included in the main paper . The notion of "manifold" should be clarified. We will explain this further in the paper.
Robustness Feature Adapter for Efficient Adversarial Training
Wu, Quanwei, Guo, Jun, Wang, Wei, Wang, Yi
Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense
Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39\% ( 4.01\%) on CIFAR-10, 56.25\% ( 3.13\%) on CIFAR-100, and 82.62\% ( 4.93\%) on GTSRB (German Traffic Sign Recognition Benchmark).